library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)Beyond Least Squares: Using Likelihoods
BMLR Chapter 2
Setup
Learning goals
Describe the concept of a likelihood
Construct the likelihood for a simple model
Define the Maximum Likelihood Estimate (MLE) and use it to answer an analysis question
Identify three ways to calculate or approximate the MLE and apply these methods to find the MLE for a simple model
Use likelihoods to compare models (next week)
What is the likelihood?
A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).
Unlike Ordinary Least Squares (OLS), they do not require the responses be independent, identically distributed, and normal (iidN)
They are not the same as probability functions
. . .
Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)
Likelihood: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value
Fouls in college basketball games
The data set 04-refs.csv includes 30 randomly selected NCAA men’s basketball games played in the 2009 - 2010 season.
We will focus on the variables foul1, foul2, and foul3, which indicate which team had a foul called them for the 1st, 2nd, and 3rd fouls, respectively. - H: Foul was called on the home team - V: Foul was called on the visiting team
We are focusing on the first three fouls for this analysis, but this could easily be extended to include all fouls in a game.
Fouls in college basketball games
refs <- read_csv("data/04-refs.csv")
refs %>% slice(1:5) %>% kable()| game | date | visitor | hometeam | foul1 | foul2 | foul3 |
|---|---|---|---|---|---|---|
| 166 | 20100126 | CLEM | BC | V | V | V |
| 224 | 20100224 | DEPAUL | CIN | H | H | V |
| 317 | 20100109 | MARQET | NOVA | H | H | H |
| 214 | 20100228 | MARQET | SETON | V | V | H |
| 278 | 20100128 | SETON | SFL | H | V | V |
We will treat the games as independent in this analysis.
Different likelihood models
Model 1 (Unconditional Model): What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
Model 2 (Conditional Model): - Is there a tendency for the referees to call more fouls on the visiting team or home team? - Is there a tendency for referees to call a foul on the team that already has more fouls?
Ultimately we want to decide which model is better.
Exploratory data analysis
refs %>%
count(foul1, foul2, foul3) %>% kable()| foul1 | foul2 | foul3 | n |
|---|---|---|---|
| H | H | H | 3 |
| H | H | V | 2 |
| H | V | H | 3 |
| H | V | V | 7 |
| V | H | H | 7 |
| V | H | V | 1 |
| V | V | H | 5 |
| V | V | V | 2 |
There are - 46 total fouls on the home team - 44 total fouls on the visiting team
Model 1: Unconditional model
What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
Likelihood
Let \(p_H\) be the probability the referees call a foul on the home team.
The likelihood for a single observation
\[Lik(p_H) = p_H^{y_i}(1 - p_H)^{n_i - y_i}\]
Where \(y_i\) is the number of fouls called on the home team.
(In this example, we know \(n_i = 3\) for all observations.)
. . .
Example
For a single game where the first three fouls are \(H, H, V\), then
\[Lik(p_H) = p_H^{2}(1 - p_H)^{3 - 2} = p_H^{2}(1 - p_H)\]
Model 1: Likelihood contribution
| Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
|---|---|---|---|---|
| H | H | H | 3 | \(p_H^3\) |
| H | H | V | 2 | \(p_H^2(1 - p_H)\) |
| H | V | H | 3 | \(p_H^2(1 - p_H)\) |
| H | V | V | 7 | A |
| V | H | H | 7 | B |
| V | H | V | 1 | \(p_H(1 - p_H)^2\) |
| V | V | H | 5 | \(p_H(1 - p_H)^2\) |
| V | V | V | 2 | \((1 - p_H)^3\) |
. . .
Fill in A and B.
Model 1: Likelihood function
Because the observations (the games) are independent, the likelihood is
\[Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}\]
We will use this function to find the maximum likelihood estimate (MLE). The MLE is the value between 0 and 1 where we are most likely to see the observed data.
Visualizing the likelihood
p <- seq(0,1, length.out = 100) #sequence of 100 values between 0 and 100
lik <- p^46 *(1 -p)^44
x <- tibble(p = p, lik = lik)
ggplot(data = x, aes(x = p, y = lik)) +
geom_point() +
geom_line() +
labs(y = "Likelihood",
title = "Likelihood of p_H")Q: What is your best guess for the MLE, \(\hat{p}_H\)?
A. 0.489
B. 0.500
C. 0.511
D. 0.556
Finding the maximum likelihood estimate
There are three primary ways to find the MLE
. . .
✅ Approximate using a graph
✅ Numerical approximation
✅ Using calculus
Approximate MLE from a graph
Find the MLE using numerical approximation
Specify a finite set of possible values the for \(p_H\) and calculate the likelihood for each value
# write an R function for the likelihood
ref_lik <- function(ph) {
ph^46 *(1 - ph)^44
}# use the optimize function to find the MLE
optimize(ref_lik, interval = c(0,1), maximum = TRUE)$maximum
[1] 0.5111132
$objective
[1] 8.25947e-28
Find MLE using calculus
Find the MLE by taking the first derivative of the likelihood function.
This can be tricky because of the Product Rule, so we can maximize the log(Likelihood) instead. The same value maximizes the likelihood and log(Likelihood)
. . .
. . .
Since calculus is not a pre-req, we will forgo this quest.
Model 2: Conditional model
Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?
Model 2: Likelihood contributions
- Now let’s assume fouls are not independent within each game. We will specify this dependence using conditional probabilities.
- Conditional probability: \(P(A|B) =\) Probability of \(A\) given \(B\) has occurred
. . .
Define new parameters:
\(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
\(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
\(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team
Model 2: Likelihood contributions
| Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
|---|---|---|---|---|
| H | H | H | 3 | \(p_H^3\) |
| H | H | V | 2 | \(p_H^2(1 - p_H)\) |
| H | V | H | 3 | \(p_H^2(1 - p_H)\) |
| H | V | V | 7 | A |
| V | H | H | 7 | B |
| V | H | V | 1 | \(p_H(1 - p_H)^2\) |
| V | V | H | 5 | \(p_H(1 - p_H)^2\) |
| V | V | V | 2 | \((1 - p_H)^3\) |
Fill in A and B
Likelihood function
\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]
(Note: The exponents sum to 90, the total number of fouls in the data)
. . .
\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]
Q: If fouls within a game are independent, how would you expect \(\hat{p}_H\), \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H\vert V Bias}\) to compare?
\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert V Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
They are all approximately equal.
Q: If there is a tendency for referees to call a foul on the team that already has more fouls, how would you expect \(\hat{p}_H\) and \(\hat{p}_{H\vert H Bias}\) to compare?
\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\)
They are approximately equal.
Acknowledgements
These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression
Initial versions of the slides are by Dr. Maria Tackett, Duke University